Accelerated Clustering Through Locality - Sensitive Hashing by Shaunak Kishore
نویسندگان
چکیده
We obtain improved running times for two algorithms for clustering data: the expectationmaximization (EM) algorithm and Lloyd's algorithm. The EM algorithm is a heuristic for finding a mixture of k normal distributions in Rd that maximizes the probability of drawing n given data points. Lloyd's algorithm is a special case of this algorithm in which the covariance matrix of each normally-distributed component is required to be the identity. We consider versions of these algorithms where the number of mixture components is inferred by assuming a Dirichlet process as a generative model. The separation probability of this process, a, is typically a small constant. We speed up each iteration of the EM algorithm from O(nd2 k) to O(ndk log 3(k/a))+nd 2 ) time and each iteration of Lloyd's algorithm from O(ndk) to O(nd(k/a). 39) time. Thesis Supervisor: Jonathan A. Kelner Title: Assistant Professor
منابع مشابه
Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing
MOTIVATION Similarity searching and clustering of chemical compounds by structural similarities are important computational approaches for identifying drug-like small molecules. Most algorithms available for these tasks are limited by their speed and scalability, and cannot handle today's large compound databases with several million entries. RESULTS In this article, we introduce a new algori...
متن کاملKernelized Locality-Sensitive Hashing for Semi-Supervised Agglomerative Clustering
Large scale agglomerative clustering is hindered by computational burdens. We propose a novel scheme where exact inter-instance distance calculation is replaced by the Hamming distance between Kernelized Locality-Sensitive Hashing (KLSH) hashed values. This results in a method that drastically decreases computation time. Additionally, we take advantage of certain labeled data points via distanc...
متن کاملHierarchical clustering of large text datasets using Locality-Sensitive Hashing
In this paper, we present a hierarchical clustering algorithm of the large text datasets using Locality-Sensitive Hashing (LSH). The main idea of the LSH is to “hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar are. The main drawback of the conventional hierarchical algorithms is a large time complexity (e.g. Single Linka...
متن کاملHigh-Throughput, Web-Scale Data Stream Clustering
Clustering is an important technique for analysing and interpreting massive quantities of data present on the web. However the sheer volume of data, along with its often dynamic and fast changing nature provide a challenge for traditional clustering approaches. We present a parallel clustering system specifically designed for continuous, real-time clustering of web-scale message data streams. A...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013